Multilingual bottleneck features for subword modeling in zero-resource languages

نویسندگان

  • Enno Hermann
  • Sharon Goldwater
چکیده

How can we effectively develop speech technology for languages where no transcribed data is available? Many existing approaches use no annotated resources at all, yet it makes sense to leverage information from large annotated corpora in other languages, for example in the form of multilingual bottleneck features (BNFs) obtained from a supervised speech recognition system. In this work, we evaluate the benefits of BNFs for subword modeling (feature extraction) in six unseen languages on a word discrimination task. First we establish a strong unsupervised baseline by combining two existing methods: vocal tract length normalisation (VTLN) and the correspondence autoencoder (cAE). We then show that BNFs trained on a single language already beat this baseline; including up to 10 languages results in additional improvements which cannot be matched by just adding more data from a single language. Finally, we show that the cAE can improve further on the BNFs if high-quality same-word pairs are available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The 2016 RWTH Keyword Search System for Low-Resource Languages

In this paper we describe the RWTH Aachen keyword search (KWS) system developed in the course of the IARPA Babel program. We put focus on acoustic modeling with neural networks and evaluate the full pipeline with respect to the KWS performance. At the core of this study lie multilingual bottleneck features extracted from a deep neural network trained on all 28 languages available to the project...

متن کامل

Multilingual hierarchical MRASTA features for ASR

Recently, a multilingual Multi Layer Perceptron (MLP) training method was introduced without having to explicitly map the phonetic units of multiple languages to a common set. This paper further investigates this method using bottleneck (BN) tandem connectionist acoustic modeling for four high-resourced languages — English, French, German, and Polish. Aiming at the improvement of already existi...

متن کامل

Language ID-based training of multilingual stacked bottleneck features

In this paper, we explore multilingual feature-level data sharing via Deep Neural Network (DNN) stacked bottleneck features. Given a set of available source languages, we apply language identification to pick the language most similar to the target language, for more efficient use of multilingual resources. Our experiments with IARPA-Babel languages show that bottleneck features trained on the ...

متن کامل

Pronunciation Lexicon Development for Under-Resourced Languages Using Automatically Derived Subword Units: A Case Study on Scottish Gaelic

Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. To avoid the need for the linguistic knowledge, acoustic information can be used to automatically obtain the subword units and the associated pronunciations. Towards that, the present paper investigates the potential of a rec...

متن کامل

Multilingual bottleneck features for language recognition

In this paper, we investigate Multilingual Stacked Bottleneck Features (SBN) in language recognition domain. These features are extracted using bottleneck neural networks trained on data from multiple languages. Previous results have shown benefits of multilingual training of SBN feature extractor for speech recognition. Here we focus on its impact on language recognition. We present results ob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018